Mini-challenge 3

A short description of the post.

Qian Ziwei https://example.com/norajones
07-11-2021

1. Overview


1.1 Background

In the country island of Kronos, the increasing noxious effects on health and farming have been related to the uncontrolled activity of GAStech, a natural gas operator, supported by corrupt government officials. On January 20th, 2014, a corporate meeting is held to celebrate the new-found fortune because of the initial public offering of the company. However, a series of rare events occur that lead to the disappearance of several employees. The Protectors of Kronos (POK), a social movement organization that has been fighting against water contamination and government corruption, is suspected in the disappearance.

As analysts, we were assigned with several tasks in order to identify risks and how they could have been mitigated more effectively.

1.2 Literature review

1.3 Objective

Using data and visual analytics to evaluate the changing levels of risk to the public and recommend actions for first responder:

1.4 Question 1

1.5 Question 2

1.6 Question 3

1.7 Question 4


2. Building the visualization


2.1 Setting up the environment/packages

First, we run this fist line of code to clear the environment and remove existing R objects(if any).

rm(list = ls())

This code chunk checks if required packages are installed. If they are not installed, the next line of code will install them. The following line is then use to import the library into the current working environment.

packages = c('readr','tidytext','data.table','lubridate','ggplot2',
             'caret','dplyr','tidyr','scales','quanteda','textdata',
             'stringr','stringi','reshape2','RColorBrewer','wordcloud',
             'forcats','igraph','ggraph','widyr','clock','knitr','tidyverse',
             'DT','hms','ggiraph','topicmodels','raster','sf','maptools',
             'rgdal','ggmap','sp','tmap','tmaptools','devtools','patchwork')
for(p in packages){
  if(!require(p,character.only = TRUE)){
    install.packages(p)
  }
  library(p,character.only = TRUE)
}

2.2 Importing data and changing data type

First, use read_csv() to import the csv file.

read1 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1700-1830.csv",
                  col_types = list(col_character(),col_character(),col_character(),
                                   col_character(),col_double(),col_double(),
                                   col_character()))
read2 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1831-2000.csv",
                  col_types = list(col_character(),col_character(),col_character(),
                                   col_character(),col_double(),col_double(),
                                   col_character()))
read3 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-2001-2131.csv",
                  col_types = list(col_character(),col_character(),col_character(),
                                   col_character(),col_double(),col_double(),
                                   col_character()))

Using function rbind() combine these three csv files with the same format.

df <- rbind.data.frame(read1,read2,read3)
glimpse(df)
Rows: 4,063
Columns: 7
$ type                   <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <chr> "20140123170000", "20140123170000", "~
$ author                 <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message                <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
DT::datatable(df,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px',targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf'))) 

From the above table the date(yyyyMMddHHmmss) is not in time format, so converting to date-time field. Because in the mini-challenge 3, all activities occur on the same day.Extract time(hms) data without date and transform.

df$`date(yyyyMMddHHmmss)` <- date_time_parse(df$`date(yyyyMMddHHmmss)`,
                                 zone = "",
                                 format = "%Y%m%d %H%M%S")
df$time <- as_hms(ymd_hms((df$`date(yyyyMMddHHmmss)`)))
glimpse(df)
Rows: 4,063
Columns: 8
$ type                   <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <dttm> 2014-01-23 17:00:00, 2014-01-23 17:0~
$ author                 <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message                <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
$ time                   <time> 17:00:00, 17:00:00, 17:00:00, 17:00:~
DT::datatable(df,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px',targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

2.3 Question 1

2.3.1 Data processing

In question 1, extract the required data from the df to make a new data frame. After reading all the data carefully, the terminology of mbdata and ccdata is very different, so separate the two files. Since ccdata is a police or fire department record, this dataset is labeled as a meaningful dataset.

df1 <- subset(df, select = c("type","author","message"))
df_m <- subset(df1, type == "mbdata")
df_cc <- subset(df1,type == "ccdata")
df_cc$condition <- "meaningful"
DT::datatable(df_cc,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px',targets = c(3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
2.3.1.1 Junk message

JUNk definition: After reading all the data carefully, I have selected the following types of data.

junk <- df_m %>%
  filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
                    blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
                    junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
                    panopticon|dels4realz|eazymoney|cleaningFish")|
           str_detect(message,"#Grammar|RT"))
2.3.1.2 Meaningful message

Meaningful definition: After reading all the data carefully, I have selected the following types of data.

meaningful <- df_m %>%
  filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
  MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
  HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
  Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
  dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
           str_detect(message,[2191 chars quoted with '"']))
2.3.1.3 Meaningless message

Meaningless definition: After reading all the data carefully, I have selected the following types of data.

This group is obtained by subtracting other groups from the df_m through anti_join() function.

meaningful <- dplyr::anti_join(meaningful,junk,by = c("type", "author", "message"))

combinedata <- rbind.data.frame(meaningful, junk)

meaningless <- dplyr::anti_join(df_m,combinedata,by = c("type", "author", "message"))
2.3.1.4 Combining meaningful,meaninfless and junk message

Combine meaningful,meaningless and junk data and add a new label column.

junk$condition <- "junk"
meaningful$condition <- "meaningful"
meaningless$condition <- "meaningless"

finalq1 <- rbind.data.frame(meaningful,junk, meaningless)
DT::datatable(finalq1,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
2.3.1.5 Cleaning the dataset before token

Use stringr package to remove punctuation, @, #, < and Chinese characters from messages. The messages in ccdata are very clean and do not require special handling.

finalq1$message <- str_replace_all(finalq1$message,'[[:punct:]]+', "")

finalq1$message <- str_replace_all(finalq1$message,fixed("@"),"")

finalq1$message <- str_replace_all(finalq1$message,fixed("#"),"")

finalq1$message <- str_replace_all(finalq1$message,fixed("<"),"")

finalq1$message <- str_replace_all(finalq1$message,"[\u4e00-\u9fa5]+", "")

The messages in the table below is clean.

DT::datatable(finalq1,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
2.3.1.6 Token the data and custom stop-words.

Exclude stop words from the text and use tibble() to custom stop words selected according to the content of the text.

tidy_m <- finalq1 %>%
  unnest_tokens(word, message) %>%
  count(condition,word,sort = TRUE)


data(stop_words)
tidy_m <- tidy_m %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
tidy_m <- tidy_m %>%
  anti_join(my_stopwords)

tidy_cc <- df_cc %>%
  unnest_tokens(word,message) %>%
  count(word, sort = TRUE)

DT::datatable(tidy_m,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
DT::datatable(tidy_cc,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:2))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

2.3.2 Simple EDA

The graph below is the word number distribution of junk,meaningful and meaningless group. The junk message has high repetition rate of words.

ggplot(tidy_m,aes(n,fill = condition))+
  geom_histogram(show.legend = FALSE)+
  scale_fill_brewer(palette = "Pastel1")+
  xlim(0,100)+
  facet_wrap(~condition, ncol = 2,scales = "free_y")

The graph below is the top 15 word(n) of junk,meaningful and meaningless group.

tidy_m %>%
  group_by(condition) %>%
  slice_max(n, n= 15) %>%
  ungroup() %>%
  mutate(word = reorder_within(word,n,condition)) %>%
  ggplot(aes(x = n,
             y= word,
             fill =condition))+
  geom_col(show.legend = FALSE)+
  scale_fill_brewer(palette = "Pastel1")+
  scale_y_reordered()+
  facet_wrap(~ condition, ncol = 2,scales = "free")+
  ggtitle("mbdata") +
  theme(plot.title = element_text(size=10,
                                  hjust = 0.4))+
  labs(y = NULL)

The graph of ccdata shows that the events like “fire”,“traffic”. Word like “vehicle” is issued to the event–“From hit-and-run accident to shooting and standoff”.

tidy_cc %>%
  slice_max(n,n = 15) %>%
  ggplot(aes(x = n,
             y= reorder(word,n)))+
  geom_col(show.legend = FALSE, fill = "darkgoldenrod1")+
  ggtitle("ccdata_meaningful") +
  theme(plot.title = element_text(size=10,
                                  hjust = 0.45))+
  labs(y = NULL)

2.3.3 Visualization the ccdata and mbdata

2.3.3.1 Wordcloud

The conclusion is similar to Simple EDA.

wordcloud_m <- tidy_m

wordcloud_m <- finalq1 %>%
  filter(condition == "meaningful") %>%
  unnest_tokens(word, message)%>%
  anti_join(stop_words) %>%
  anti_join(my_stopwords) %>%
  count(word,sort = TRUE)%>%
  with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
  filter(condition == "meaningless") %>%
  unnest_tokens(word, message)%>%
  anti_join(stop_words) %>%
  anti_join(my_stopwords) %>%
  count(word,sort = TRUE)%>%
  with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
  filter(condition == "junk") %>%
  unnest_tokens(word, message)%>%
  count(word,sort = TRUE)%>%
  anti_join(stop_words) %>%
  anti_join(my_stopwords) %>%
  with(wordcloud(word,n,max.words = 100))
tidy_cc %>%
  with(wordcloud(word,n,max.words = 100))

2.3.3.2 tf-idf visualization

Use the bind_tf_idf() to find the important words for the different categories.

m_tf_idf <- tidy_m %>%
  bind_tf_idf(word,condition,n)

m_tf_idf %>%
  arrange(desc(tf_idf))
# A tibble: 4,111 x 6
   condition   word                           n      tf   idf  tf_idf
   <chr>       <chr>                      <int>   <dbl> <dbl>   <dbl>
 1 junk        rt                          1000 0.0558  1.10  0.0613 
 2 junk        kronosstar                   884 0.0493  0.405 0.0200 
 3 junk        homelandilluminations        183 0.0102  1.10  0.0112 
 4 junk        grammar                      157 0.00876 1.10  0.00962
 5 junk        abilapost                    330 0.0184  0.405 0.00746
 6 junk        rally                        260 0.0145  0.405 0.00588
 7 meaningless cards                          9 0.00521 1.10  0.00572
 8 meaningless easycreditkronosmorecredit     9 0.00521 1.10  0.00572
 9 meaningless nobanks                        9 0.00521 1.10  0.00572
10 meaningful  abilapost                     70 0.0137  0.405 0.00555
# ... with 4,101 more rows
m_tf_idf %>%
  group_by(condition) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  mutate(word = reorder_within(word,tf_idf,condition)) %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = condition))+
  scale_fill_brewer(palette = "Pastel1")+
  scale_y_reordered()+
  geom_col(show.legend = FALSE)+
  facet_wrap(~condition,ncol = 2,scales = "free")+
  ggtitle("mbdata") +
  theme(plot.title = element_text(size=10,
                                  hjust = 0.4))+
  labs(y = NULL)

2.3.3.3 Bigrams visualization

Use the bigram to find the important phrases for the different categories.

Actually,there is no big difference for these three categories. But meaningful messages contain more specific time and place phrases.

meaningful_bigrams <- meaningful %>%
  unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningful_bigrams %>%
  count(bigram, sort = TRUE)
# A tibble: 5,775 x 2
   bigram                    n
   <chr>                 <int>
 1 viktor e                 48
 2 of the                   42
 3 dancing dolphin          41
 4 in the                   40
 5 at the                   38
 6 abila centralbulletin    30
 7 pok rally                28
 8 to the                   24
 9 dr newman                23
10 dolphin fire             20
# ... with 5,765 more rows
meaningful_separated <- meaningful_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
meaningful_filtered <- meaningful_separated %>%
  filter(!word1 %in% my_stopwords) %>%
  filter(!word2 %in% my_stopwords)
meaningful_filtered <- meaningful_filtered %>%
    filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
meaningful_counts <- meaningful_filtered %>% 
  count(word1, word2, sort = TRUE)
meaningful_graph <- meaningful_counts %>%
  filter(n > 4) %>%
  graph_from_data_frame()

set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(meaningful_graph,layout = "fr")+
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()
junk_bigrams <- junk %>%
  unnest_tokens(bigram,message,token = "ngrams", n = 2)
junk_bigrams %>%
  count(bigram, sort = TRUE)
# A tibble: 7,338 x 2
   bigram                       n
   <chr>                    <int>
 1 pokrally hi                670
 2 kronosstar pokrally        598
 3 pok rally                  234
 4 rt homelandilluminations   183
 5 rt abilapost               169
 6 rally grammar              157
 7 rt kronosstar              143
 8 if you                     126
 9 of the                     115
10 you can                    102
# ... with 7,328 more rows
junk_separated <- junk_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
junk_filtered <- junk_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
junk_counts <- junk_filtered %>% 
  count(word1, word2, sort = TRUE)
junk_graph <- junk_counts %>%
  filter(n > 50) %>%
  graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(meaningful_graph,layout = "fr")+
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()
meaningless_bigrams <- meaningless %>%
  unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningless_bigrams %>%
  count(bigram, sort = TRUE)
# A tibble: 2,051 x 2
   bigram                           n
   <chr>                        <int>
 1 badprofiles.kronos tacky        12
 2 of the                          12
 3 viktor e                        10
 4 abila nobanks                    9
 5 abila pictures                   9
 6 cards get                        9
 7 credit cards                     9
 8 easy credit                      9
 9 easycredit.kronos morecredit     9
10 get what                         9
# ... with 2,041 more rows
meaningless_separated <- meaningless_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
meaningless_filtered <- meaningless_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
meaningless_counts <- meaningless_filtered %>% 
  count(word1, word2, sort = TRUE)
meaningless_graph <- meaningless_counts %>%
  filter(n > 4) %>%
  graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(meaningful_graph,layout = "fr")+
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

2.3.4 Question 1 conclusion

We can distinguish the differences according to the characteristics of the words used in each category.

2. 4 Question 2

2.4.1 Data preparation

2.4.1.1 Getting a whole meaningful data

Repeating the process above and get a whole meaningful dataset.

q2_m <- subset(df, type == "mbdata")
q2_cc <- subset(df,type == "ccdata")
q2_junk <- q2_m %>%
    filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
                    blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
                    junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
                    panopticon|dels4realz|eazymoney|cleaningFish")|
           str_detect(message,"#Grammar|RT"))


q2_meaningful <- q2_m %>%
  filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
  MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
  HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
  Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
  dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
           str_detect(message,[2191 chars quoted with '"']))


q2_meaningful <- dplyr::anti_join(q2_meaningful,q2_junk)
2.4.1.2 Dividing three events and processing data separately
Try to do LDA on the meaningful dataset, but the results of LDA is not good. So change to use the key words to define each event.
q2_lda <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))
q2_lda <- na.omit(q2_lda)

tidy_q2 <- q2_lda %>%
  unnest_tokens(word,message)

q2_wordcount <- tidy_q2 %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_wordcount <- q2_wordcount %>%
  anti_join(my_stopwords) %>%
  count(author,word,sort = TRUE)
q2_wordcount
# A tibble: 453 x 3
   author        word             n
   <chr>         <chr>        <int>
 1 footfingers   pok             21
 2 truccotrucco  standoff        21
 3 truccotrucco  im              12
 4 truccotrucco  shooting        12
 5 truccotrucco  gelatogalore     9
 6 truccotrucco  van              9
 7 truthforcadau viktor           8
 8 footfingers   kronos           7
 9 dangermice    abilafire        6
10 footfingers   people           6
# ... with 443 more rows
q2_dtm <- q2_wordcount %>%
  cast_dfm(author,word,n)

q2_author_lda <- LDA(q2_dtm,k = 3, control = list(seed = 1234))

q2_topics <- tidy(q2_author_lda,matrix = "beta")

q2_topics
# A tibble: 1,092 x 3
   topic term         beta
   <int> <chr>       <dbl>
 1     1 pok      1.37e- 1
 2     2 pok      2.64e-82
 3     3 pok      1.84e- 2
 4     1 standoff 1.25e-81
 5     2 standoff 2.41e- 4
 6     3 standoff 7.63e- 2
 7     1 im       3.95e-82
 8     2 im       5.06e- 5
 9     3 im       4.28e- 2
10     1 shooting 1.24e-87
# ... with 1,082 more rows
q2_topics %>%
  group_by(topic) %>%
  top_n(10,beta) %>%
  ungroup() %>%
  mutate(term = reorder_within(term,beta,topic)) %>%
  ggplot(aes(beta,term,fill = topic))+
  scale_y_reordered()+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic,ncol = 2,scales = "free")+
  ggtitle("meaningful") +
  theme(plot.title = element_text(size=10,
                                  hjust = 0.4))+
  labs(y = NULL)
q2_author_lda2 <- LDA(q2_dtm,k = 4, control = list(seed = 1234))

q2_topics2 <- tidy(q2_author_lda2,matrix = "beta")

q2_topics2
# A tibble: 1,456 x 3
   topic term          beta
   <int> <chr>        <dbl>
 1     1 pok      1.37e-  1
 2     2 pok      6.08e-121
 3     3 pok      1.92e-  2
 4     4 pok      3.27e-115
 5     1 standoff 6.36e-121
 6     2 standoff 2.68e-  2
 7     3 standoff 6.71e-  2
 8     4 standoff 2.70e-104
 9     1 im       3.87e-121
10     2 im       1.34e-  2
# ... with 1,446 more rows
q2_topics2 %>%
  group_by(topic) %>%
  top_n(10,beta) %>%
  ungroup() %>%
  mutate(term = reorder_within(term,beta,topic)) %>%
  ggplot(aes(beta,term,fill = topic))+
  scale_y_reordered()+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic,ncol = 2,scales = "free")+
  ggtitle("meaningful") +
  theme(plot.title = element_text(size=10,
                                  hjust = 0.4))+
  labs(y = NULL)
q2_author_lda3 <- LDA(q2_dtm,k = 5, control = list(seed = 1234))

q2_topics3 <- tidy(q2_author_lda3,matrix = "beta")

q2_topics3
# A tibble: 1,820 x 3
   topic term          beta
   <int> <chr>        <dbl>
 1     1 pok      1.75e-  1
 2     2 pok      1.00e-159
 3     3 pok      1.92e-  2
 4     4 pok      3.64e-154
 5     5 pok      7.14e-  2
 6     1 standoff 1.46e-157
 7     2 standoff 2.68e-  2
 8     3 standoff 6.71e-  2
 9     4 standoff 1.03e-129
10     5 standoff 4.84e-154
# ... with 1,810 more rows
q2_topics3 %>%
  group_by(topic) %>%
  top_n(5,beta) %>%
  ungroup() %>%
  mutate(term = reorder_within(term,beta,topic)) %>%
  ggplot(aes(beta,term,fill = topic))+
  scale_y_reordered()+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic,ncol = 2,scales = "free")+
  ggtitle("meaningful") +
  theme(plot.title = element_text(size=10,
                                  hjust = 0.4))+
  labs(y = NULL)

POK event

Using the important words of pok rally event to choose the pok rally relevant messages from the meaningful dataset.

q2_rally_m <- q2_meaningful %>%
  filter(str_detect(message,"pokrally|Abila City Park|Stand Up Speak Up|Sylvia Marek|Audrey McConnell Newman, Professor Lorenzo Di Stefano|Lucio Jakab|Viktor-E|Sylvia|Marek|Newman|Stefano|Di Stefano|Lucio|Jakab"))

q2_rally_cc <- q2_cc %>%
  filter(str_detect(message,"ABILA CITY PARK|CROWD"))

q2_rally <- rbind(q2_rally_m,q2_rally_cc)

DT::datatable(q2_rally,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

Clean the message content.

q2_rally$message <- str_replace_all(q2_rally$message,'[[:punct:]]+', "")

q2_rally$message <- str_replace_all(q2_rally$message,fixed("@"),"")

q2_rally$message <- str_replace_all(q2_rally$message,fixed("#"),"")

q2_rally$message <- str_replace_all(q2_rally$message,fixed("<"),"")

q2_rally$message <- str_replace_all(q2_rally$message,"[\u4e00-\u9fa5]+", "")

Token the dataset.

q2_rally_tidy <- q2_rally %>%
  unnest_tokens(word, message)

data(stop_words)
q2_rally_tidy <- q2_rally_tidy %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_rally_tidy <- q2_rally_tidy %>%
  anti_join(my_stopwords)

DT::datatable(q2_rally_tidy,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Fire in Dancing Dolphin

Using the important words of “Fire in Dancing Dolphin” event to choose the fire relevant messages from the meaningful dataset.

q2_fire_m <- q2_meaningful %>%
  filter(str_detect(message,"fire|dolphin|dancing|building|apartment|Madeg|dispatch|afd|floor|floors|fireman|firefighters|firefighter|evacuate|evacuated|evacuating|evacuation|trapped|injuries|scene|trapped|collapsed|blaze|escalated"))

q2_fire_cc <- q2_cc %>%
  filter(str_detect(message,"Fire|Crime|scene"))

q2_fire <- rbind(q2_fire_m,q2_fire_cc)

DT::datatable(q2_fire,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

Clean the message content.

q2_fire$message <- str_replace_all(q2_fire$message,'[[:punct:]]+', "")

q2_fire$message <- str_replace_all(q2_fire$message,fixed("@"),"")

q2_fire$message <- str_replace_all(q2_fire$message,fixed("#"),"")

q2_fire$message <- str_replace_all(q2_fire$message,fixed("<"),"")

q2_fire$message <- str_replace_all(q2_fire$message,"[\u4e00-\u9fa5]+", "")

Token the dataset.

q2_fire_tidy <- q2_fire %>%
  unnest_tokens(word, message)

data(stop_words)
q2_fire_tidy <- q2_fire_tidy %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_fire_tidy <- q2_fire_tidy %>%
  anti_join(my_stopwords)

DT::datatable(q2_fire_tidy,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
From hit-and-run accident to shooting and standoff

Using the important words of “From hit-and-run accident to shooting and standoff” event to choose the fire relevant messages from the meaningful dataset.

q2_accident_m <- q2_meaningful %>%
  filter(str_detect(message,"shooting|stanoff|hostage|swat|negotiation|fight|arrest|hit|van| driver|bicyclist|accident|incident|bike|L829|pursuit|gun|shot|kill|dead|yelling|screaming|negotiatingnegotiator|caught|over|end|shoot|shot|chasing"))

q2_accident_cc <- q2_cc %>%
  filter(str_detect(message,"van|pursuit|accident|vandalism|swat"))

q2_accident <- rbind(q2_accident_m,q2_accident_cc)

DT::datatable(q2_accident,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

Clean the message content.

q2_accident$message <- str_replace_all(q2_accident$message,'[[:punct:]]+', "")

q2_accident$message <- str_replace_all(q2_accident$message,fixed("@"),"")

q2_accident$message <- str_replace_all(q2_accident$message,fixed("#"),"")

q2_accident$message <- str_replace_all(q2_accident$message,fixed("<"),"")

q2_accident$message <- str_replace_all(q2_accident$message,"[\u4e00-\u9fa5]+", "")

Token the dataset.

q2_accident_tidy <- q2_accident %>%
  unnest_tokens(word, message)

data(stop_words)
q2_accident_tidy <- q2_accident_tidy %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_accident_tidy <- q2_accident_tidy %>%
  anti_join(my_stopwords)

DT::datatable(q2_accident_tidy,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

2.4.2 Visualization different events

Visualizaiton different events

From the processing of the data, it was found that there were three events in the mini-challenge 3.

2.4.2.1 POK event

POK rally was held on Abila City Park. About 2000 people gathered there and heavy police was disposed.

q2_rally_tidy%>%
  filter(str_detect(word,"rally")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#99CCFF")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="rally")
q2_rally_tidy%>%
  filter(str_detect(word,"pok")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#99CCFF")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y ="pok")

q2_rally_tidy%>%
  filter(str_detect(word,"sylvia|marek")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#6699CC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y ="sylvia|marek")

q2_rally_tidy%>%
  filter(str_detect(word,"lucio|jakab")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#6699CC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
  filter(str_detect(word,"viktor")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#6699CC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="viktor")
q2_rally_tidy%>%
  filter(str_detect(word,"lorenzo|di|stefano")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#6699CC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="lorenzo|di|stefano")
q2_rally_tidy%>%
  filter(str_detect(word,"audrey|mcConnell|newman")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#6699CC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="audrey|mcConnell|newman")

2.4.2.2 Fire in Dancing Dolphin

The fire was at Dancing Dolphin.

*At 21:30, AFD reported an explosion at the Dancing Dolphin.

q2_fire_tidy%>%
  filter(str_detect(word,"fire|dolphin|dancing|building|apartment")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#FFCCCC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y = NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"floor|floors|upper|resident")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#FFCCCC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"afd|police|cop|cops|fireman|firefighter")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#FFCCCC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"ambulance|injury|injuries|evacuated|evacuating|evacuation|evacuate")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#FF99CC")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"control")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#CC6699")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y = "control")
q2_fire_tidy%>%
  filter(str_detect(word,"collapsed|blaze|escalated|explosion")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#993366")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)

2.4.2.3 From hit-and-run accident to shooting and standoff
q2_accident_tidy%>%
  filter(str_detect(word,"hit|run|van|bicyclist|driver|incident|accident|bike|pursuit")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#99CC99")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y = NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"gun|shoot|shot|histage")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#00CC66")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"killed|dead")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#009900")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"standoff|negotiating|negotiate|negotiator|negotiation|yelling|screaming|chasing")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#006600")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"end|over|ocaught|rescued")) %>%
  ggplot(aes(x = time)) +
  geom_histogram(fill = "#003300")+
  coord_cartesian(xlim = c(61200,77460))+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)

2.5 Question 3

2.5.1 Data processing

bgmap <- raster("F:/visual/assignment and project/MC3/MC3/Geospatial/MC2-tourist.tif")

bgmap
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 1595, 2706, 4316070  (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05  (x, y)
extent     : 24.82419, 24.90976, 36.04499, 36.09543  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MC2-tourist.tif 
names      : MC2.tourist 
values     : 0, 255  (min, max)
abila_st <- st_read(dsn = "F:/visual/assignment and project/MC3/MC3/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `F:\visual\assignment and project\MC3\MC3\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
abila <- read_sf("F:/visual/assignment and project/MC3/MC3/Geospatial/Abila.shp")

q3_gps <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))

gps_m <- na.omit(q3_gps)

p <- gps_m %>%
  count(longitude,latitude)
p$n <- as.numeric(p$n)

gps_sf <- st_as_sf(p,
                   coords = c("longitude","latitude"),
                   crs = 4326)
gps_point <- gps_sf %>%
  st_cast("MULTIPOINT")
q3_rally_gps <- subset(q2_rally_m,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))
gps_rally_m <- na.omit(q3_rally_gps)
gps_rally_sf <- st_as_sf(gps_rally_m,
                   coords = c("longitude","latitude"),
                   crs = 4326)

q3_rally_gps <- na.omit(q3_rally_gps)

rally_count <- q3_rally_gps %>%
  count(longitude,latitude)
rally_count$n <- as.numeric(rally_count$n)

gps_rally_sf <- st_as_sf(rally_count,
                   coords = c("longitude","latitude"),
                   crs = 4326)
gps_rally_point <- gps_rally_sf %>%
  st_cast("MULTIPOINT")
q3_fire_gps <- subset(q2_fire_m,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))
gps_fire_m <- na.omit(q3_fire_gps)
gps_fire_sf <- st_as_sf(gps_fire_m,
                   coords = c("longitude","latitude"),
                   crs = 4326)

q3_fire_gps <- na.omit(q3_fire_gps)

fire_count <- q3_fire_gps %>%
  count(longitude,latitude)
fire_count$n <- as.numeric(fire_count$n)

gps_fire_sf <- st_as_sf(fire_count,
                   coords = c("longitude","latitude"),
                   crs = 4326)
gps_fire_point <- gps_fire_sf %>%
  st_cast("MULTIPOINT")
q3_accident_gps <- subset(q2_accident_m,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))
gps_accident_m <- na.omit(q3_accident_gps)
gps_accident_sf <- st_as_sf(gps_accident_m,
                   coords = c("longitude","latitude"),
                   crs = 4326)

q3_accidnet_gps <- na.omit(q3_accident_gps)

accident_count <- q3_accident_gps %>%
  count(longitude,latitude)
accident_count$n <- as.numeric(accident_count$n)
accident_count <- na.omit(accident_count)

gps_accident_sf <- st_as_sf(accident_count,
                   coords = c("longitude","latitude"),
                   crs = 4326)
gps_accident_point <- gps_accident_sf %>%
  st_cast("MULTIPOINT")
tmap_mode("view")
tm_shape(bgmap,point.per = "feature")+
  tm_rgb(r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_point,is.master = TRUE, point.per = 'feature')+
  tm_dots(size ="n")
tmap_mode("view")
tm_shape(bgmap)+
  tm_rgb(r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_rally_point) +
  tm_dots(size = "n")
tmap_mode("view")
tm_shape(bgmap)+
  tm_rgb(r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_fire_point) +
  tm_dots(size = "n")
tmap_mode("view")
tm_shape(bgmap)+
  tm_rgb(r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_accident_point) +
  tm_dots(size = "n")